{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8990a030-4edd-4362-aef2-7a1357257383",
   "metadata": {},
   "source": [
    "# L4: Preprocessing PDFs and Images"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc60f541-0ecd-4fa6-b28f-ef8fbe1f28fc",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74693f39-ab39-49b0-a4d6-50b829e941b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Warning control\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30b979ec-03a8-4296-af23-26fc86384d8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from unstructured_client import UnstructuredClient\n",
    "from unstructured_client.models import shared\n",
    "from unstructured_client.models.errors import SDKError\n",
    "\n",
    "from unstructured.partition.html import partition_html\n",
    "from unstructured.partition.pdf import partition_pdf\n",
    "\n",
    "from unstructured.staging.base import dict_to_elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caee5a38-3123-4608-8a58-60cd01675781",
   "metadata": {},
   "outputs": [],
   "source": [
    "from Utils import Utils\n",
    "utils = Utils()\n",
    "\n",
    "DLAI_API_KEY = utils.get_dlai_api_key()\n",
    "DLAI_API_URL = utils.get_dlai_url()\n",
    "\n",
    "s = UnstructuredClient(\n",
    "    api_key_auth=DLAI_API_KEY,\n",
    "    server_url=DLAI_API_URL,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2df4e009-6125-4039-8b22-6ec85dfb155d",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Access Utils File and Helper Functions:</b> To access helper functions and other related files for this notebook, 1) click on the <em>\"View\"</em> option on the top menu of the notebook and then 2) click on <em>\"File Browser\"</em>. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c88125-aafc-4adc-855a-694adfdb9180",
   "metadata": {},
   "source": [
    "## Example Document: News in PDF and HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd80195c-ded9-41db-ab5b-312d2cc4bd5c",
   "metadata": {},
   "source": [
    "### View the content of the files\n",
    "- <a href=\"example_files/el_nino.pdf\">El Nino (View PDF) -- Click Here</a>\n",
    "- <a href=\"example_files/el_nino.html\">El Nino (View HTML) -- Click Here</a>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3d6ec6b-e2d1-445e-a4df-0e6e75ef2df3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "Image(filename=\"images/el_nino.png\", height=600, width=600) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f42c528-956d-4348-b4e5-a34532e7145f",
   "metadata": {},
   "source": [
    "## Process the Document as HTML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af9e421c-77e0-4ad8-9023-19c00f047cf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"example_files/el_nino.html\"\n",
    "html_elements = partition_html(filename=filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae8f4a52-b005-4f2e-91c9-19ac6c030247",
   "metadata": {},
   "outputs": [],
   "source": [
    "for element in html_elements[:10]:\n",
    "    print(f\"{element.category.upper()}: {element.text}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2d3a088-d60d-447c-8ae8-56c1781e8af3",
   "metadata": {},
   "source": [
    "## Process the Document with Document Layout Detection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8f0a8e4-86a4-4cc7-928e-3a6f4071e71a",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"example_files/el_nino.pdf\"\n",
    "pdf_elements = partition_pdf(filename=filename, strategy=\"fast\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4651212-adcc-4b0b-bc76-28f2df3479a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "for element in pdf_elements[:10]:\n",
    "    print(f\"{element.category.upper()}: {element.text}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23a02748-d66f-40ee-842d-8a2ffb5a4684",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(filename, \"rb\") as f:\n",
    "    files=shared.Files(\n",
    "        content=f.read(),\n",
    "        file_name=filename,\n",
    "    )\n",
    "\n",
    "req = shared.PartitionParameters(\n",
    "    files=files,\n",
    "    strategy=\"hi_res\",\n",
    "    hi_res_model_name=\"yolox\",\n",
    ")\n",
    "\n",
    "try:\n",
    "    resp = s.general.partition(req)\n",
    "    dld_elements = dict_to_elements(resp.elements)\n",
    "except SDKError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27f22ac3-ae60-459d-9a3b-5f1ecdd750b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "for element in dld_elements[:10]:\n",
    "    print(f\"{element.category.upper()}: {element.text}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5ae0671-c7c6-4578-9efb-e7824958607b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d73f5235-e7aa-4e3f-b02e-c00b52b4692d",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(html_elements)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39b5c15c-3abd-493a-a8f2-389cea836f54",
   "metadata": {},
   "outputs": [],
   "source": [
    "html_categories = [el.category for el in html_elements]\n",
    "collections.Counter(html_categories).most_common()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "008e072c-f7de-4096-bf74-7e1dfef6a939",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(dld_elements)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f28733a7-6325-4953-97eb-db44a98b6649",
   "metadata": {},
   "outputs": [],
   "source": [
    "dld_categories = [el.category for el in dld_elements]\n",
    "collections.Counter(dld_categories).most_common()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70f75c9d-98d4-45e2-8bf2-74493b24a81e",
   "metadata": {},
   "source": [
    "## Work With Your Own Files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c109639-ca90-48bb-bcb2-734b4496ef81",
   "metadata": {},
   "outputs": [],
   "source": [
    "import panel as pn\n",
    "#import param\n",
    "from Utils import upld_file\n",
    "pn.extension()\n",
    "\n",
    "upld_widget = upld_file()\n",
    "pn.Row(upld_widget.widget_file_upload)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dccc5db0-19f9-469f-a970-dcfa6bade9f6",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> 🖥 &nbsp; <b>Note:</b> If the file upload interface isn't functioning properly, the issue may be related to your browser version. In such a case, please ensure your browser is updated to the latest version, or try using a different browser.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5e4febd-0b02-4098-a3a6-4d2c6e4a38a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ./example_files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64c1d29c-faad-451c-b2d7-f374b253d5ae",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Uploading Your Own File - Method 2:</b> To upload your own files, you can also 1) click on the <em>\"View\"</em> option on the top menu of the notebook and then 2) click on <em>\"File Browser\"</em>. Then 3) click on <em>\"Upload\"</em> button to upload your files. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "881e19ea-5fcf-4932-9753-8d130f91f4ba",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74c82e97-a40e-490a-b978-5fc4f9fb1214",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c707dd0e-a86c-4e4d-8426-18c404b23463",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4513c9d-20e9-41c2-a309-2fbb22f4279e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6610fe0e-c23e-4a23-a52a-d53d61a93055",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6df37882-4192-4f68-a06d-c58c93bc2b32",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa30dd13-2881-46b5-8c33-606ae2702921",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
